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Abstract 

In this article, we propose a penalized clustering method for large scale data 
with multiple covariates through a functional data approach. In the proposed 
method, responses and covariates are linked together through nonparametric 
multivariate functions (fixed effects), which have great flexibility in modeling 
a variety of function features, such as jump points, branching, and periodicity. 
Functional ANOVA is employed to further decompose multivariate functions in 
a reproducing kernel Hilbert space and provide associated notions of main effect 
and interaction. Parsimonious random effects are used to capture various corre- 
lation structures. The mixed-effect models are nested under a general mixture 
model, in which the heterogeneity of functional data is characterized. We pro- 
pose a penalized Henderson's likelihood approach for model-fitting and design 
a rejection-controlled EM algorithm for the estimation. Our method selects 
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smoothing parameters through generalized cross-validation. Furthermore, the 
Bayesian confidence intervals are used to measure the clustering uncertainty. 
Simulation studies and real-data examples are presented to investigate the em- 
pirical performance of the proposed method. Open-source code is available in 
the R package MFDA. 

Key words: Clustering, Functional Data Analysis, Mixed-Effect Model, 
Smoothing Spline, EM Algorithm. 

Running title: Penalized clustering of functional data. 



1 Introduction 

With the rapid advancement in high throughput technology, extensive repeated mea- 
surements have been taken to monitor the system- wide dynamics in many scientific 
investigations. A typical example is temporal gene expression studies, in which a series 
of micorarray experiments are conducted sequentially during a biological process, e.g., 



cell cycle microarray experiments (Spellman et al. 1998). At each time point, mRNA 



expression levels of thousands of genes are measured simultaneously. Collected over 
time, a gene's "temporal expression profile" gives the scientist some clues on what 
role this gene might play during the process. A group of genes with similar profiles are 
often "co-regulated" or participants of a common and important biological function. 
Thus clustering genes into homogeneous groups is a crucial first step to decipher the 
underlying mechanism. The need to account for intrinsic temporal dependency of 
repeated observations within the same individual renders traditional methods such as 
K-means and hierarchical clustering inadequate. By casting repeated observations as 
multivariate data with certain correlation structure, one ignores the time interval and 
time order of sampling. Additionally, missing observations in the measurements yield 
an unbalanced design, which requires imputation beforehand for application of mui- 
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tivariate approaches, e.g., the multivariate Gaussian clustering method (MCLUST, 



Fraley and Raftery 1990). 



In addition to the time factor, such repeated measurements often contain other 
covariates, e.g., replicates at each time point, species in comparative genomics studies 
( IMcCarroll et al. 20040 . and treatment groups in case-control studies ( Storey et al. 2005 ), 
as well as many factors in a factorial designed experiment. Incorporation of multi- 
ple covariates adds another layer of complexity. Clustering methods taking all these 
factors into account are still lacking. 

Recently, nonparametric analysis of data in the form of curves, i.e. functional 



data, is subject to active research. See Ramsay and Silverman (|2005|, 2002) for a com- 
prehensive treatment of functional data analysis. A curve-based clustering method 



(FCM) was introduced in James and Sugar (2003) to cluster sparsely sampled func- 
tional data. Similar approaches were developed in ILuan and LH (120031 120041) and 



Heard et al. (2006) to analyze temporal gene expression data. Although these meth- 



ods model the time factor explicitly, none of them are designed to accommodate 
additional factors. Moreover, smoothing- related parameters, e.g., knots and degrees 
of freedom, in these methods are the same across all clusters and must be specified a 
priori. Consequently, they can not model drastically different patterns among differ- 
ent clusters, which leads to high false classification rate. Finally, the computational 
costs of these methods are very high for large scale data. 

Motivated by analysis of temporal gene expression data, we propose a flexible 
functional data clustering method that overcomes the aforementioned obstacles. In 
our proposed method, responses and covariates are linked together through nonpara- 
metric multivariate functions (fixed effects), which have great flexibility in modeling 
a variety of function features, such as jump points, branching, and periodicity. Func- 
tional ANOVA is employed to further decompose multivariate functions (fixed effects) 
in a reproducing kernel Hilbert space and provide associated notions of main effect 
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and interaction (IWahba 19901 and IGu 20021) . Parsimonious random effects, comple- 
menting the fixed effects, are used to capture various correlation structures. The 
mixed-effect models are nested under a general mixture model, in which the hetero- 
geneity of the functional data is characterized. We propose a penalized Henderson's 
likelihood approach for model-fitting and design a rejection-controlled EM algorithm 
for estimation. In this EM algorithm, the E-step is followed by a rejection-controlled 
sampling step (ILiu et al. 19 98) to eliminate a significant number of functional obser- 
vations, whose posterior probabilities of belonging to a particular cluster is negligible, 
from calculation in the subsequent M-step. The M-step is decomposed into the si- 
multaneous maximization of penalized weighted least squares in each cluster. The 
smoothing parameters associated with the penalty are selected by generalized cross- 
validation, which can be shown to track a squared error loss asymptotically. Our 
method is thus data-adaptive and automatically captures some important functional 
fluctuations. For model selection, we employ BIC to select the number of clusters. 
Moreover, the proposed method not only provides subject-to-cluster assignment but 
also the estimated mean function and associated Bayesian confidence intervals for 
each cluster. The Bayesian confidence intervals are used to measure the clustering 
uncertainty. These nice features make the proposed method extremely powerful for 
clustering large scale functional data. 

The remainder of the article is organized as follows. In Section 2, we present 
a nonparametric mixed-effect model representation for functional data. A mixture 
model for clustering is considered in Section 3. Simulation and real data analysis 
follow in Section 4 and 5. A few remarks in Section 6 conclude the article. Proofs of 
the theorems are collected in Appendix. 
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2 Nonparametric Mixed- Effect Representation of 
Homogeneous Functional Data 



Assuming the data are homogeneous, i.e., the number of clusters is one, we shall 
present a mixed-effect representation of functional observations. 

2.1 The Model Specification 

We assume the functional data of the ith individual, y { = (yn, • ■ • , 2/mJ T , follows the 
mixed-effect model, 

= fj,(xi) + Zibj + £4, (2.1) 

where the population mean \x is assumed to be a smooth function defined on a generic 
domain T, Xj = (xn, . . . , £j„J T is an ordered set of sampling points, bj ~ N(0, B) is 
a p x 1 random effect vector associated with a rij x p design matrix Z,, and random 
errors e\ ~ N(0, cr 2 I) are independent of bj's, and of each other. The random effect 
covariance matrix B and random error variance a 2 are to be estimated from the 
data. Model (12.11) has been extensively studied in the statistical literature. See 



Wang (1998), Zhang et al. (1998), Gu and Ma (2005), and references therein. 



For multivariate x where x = (im, x^s, • ■ ■ , X(d)) T , each entry takes values in 
some fairly general domain 1^, i.e., T = ®fc =1 Tfc. Some examples are 

Example 2.1 T = [0, T] x {1, • • • , c} to model temporal variation from time to 
time T under multiple conditions; F = Circle x {1, ■ • • , s} to model periodicity of a 
biological process of multiple species. 

The functional ANOVA decomposition of a multivariate function \i is 

d d d 

fi(x) = no + ^2^j(x {j) ) Yl Vjk{x{j),x {k )) + ■ ■ ■ + Hi,-,d( x {i)r ■ ■ ,X(d)) (2.2) 

j=l j=l k=j+l 



where /i is a constant, the main effects, /i^'s are the two-way interactions, 

and so on. The identifiability of the terms in (I2.2p is assured by side conditions 



through averaging operators. See Wahba (1990) and Gu (2002) 



By using different specifications of the random effect b, and associated design 
matrix Zj, model (12.1 1) can accommodate various correlation structures. 

Example 2.2 If we let p — 1, i.e., bj is a scalar, B = a\ and Zj = 1, we have the 



same correlation across time. If we let p = 2, i.e. bj = (bn, 6j2) T , B 



(T b 1 b 2 °b- 2 



and Zj = (l,Xj), the difference between the ith subject profile and the mean profile 
is a linear function in time. The covariance between expression values at X\ and x 2 
for the same individual is <j\ x + (xi + X2)a\ xh ^ + XiX 2 cr b 2 2 - 



2.2 Estimation 

Model (12.11) is estimated using penalized least squares through the minimization of 

n n 

J2iVi - Mxi) - Z^fiVi - /i(x.) - Z,b,) + ^'b^B- 1 ^ + NXM(fi), (2.3) 

i=l i=l 

for iV = Y2i n ii where the first term measures the fidelity of the model to the data, 
M(/i) = M(n, n) is a quadratic functional that quantifies the roughness of fi, and 
A is the smoothing parameter that controls the trade-off between the goodness-of-fit 
and the smoothness of /i. (12.31) is also referred to as penalized Henderson's likelihood 
since the first two terms are proportional to the joint density (Henderson's likelihood) 
of (ViM) (IRobinson 1991 jl . 

To minimize (12. 31) . we only need to consider smooth functions in the space {fi : 
M(fi) < oo} or subspace therein. As a abstract generalization of the vector spaces 
used extensively in multivariate analysis, Hilbert spaces inherit many nice properties 
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of the vector spaces. However, the Hilbert space is too loose to use for functional 
data analysis since even the evaluation functional [#](/) = f{x), the simplest func- 
tional one may encounter, is not guaranteed to be continuous in a general Hilbert 
space. An example is that in the Hilbert space of square integrable functions de- 
fined on [0,1], evaluation is not even well defined. Consequently, one may focus 
on a constrained Hilbert space for which the evaluation functional is continuous. 
Such a Hilbert space is referred to as a reproducing kernel Hilbert space (RKHS), 



for which Ramsay and Silverman (2005) suggested a nickname: continuous Hilbert 
space. For example, the space of functions with square integrable second derivatives 
is an RKHS if it is equipped with appropriate inner products (IGu 20020 . For the 
evaluation functional [#](•)> by the Riesz representation theorem, there exists a non- 
negative definite bivariate function R(x,y), the reproducing kernel, which satisfies 
(R(x, •), /(•)) = f(x), called the "representer" of [#](•), in RKHS. Given an RKHS, 
we may derive the reproducing kernel from the Green's function associated with the 
quadratic functional M(fi). Since the construction of reproducing kernel is beyond 



the scope of this article, readers may refer to Wahba (1990) and Gu (2002) for details. 

The minimization of (12. 3p is performed in a reproducing kernel Hilbert space 
7i C {/i : M(/i) < oo} in which M(/i) is a square semi norm. To incorporate (12.21) in 
estimating multivariate functions, we consider fij G l~t(j), where 7i(j) is a reproducing 
kernel Hilbert space with tensor sum decomposition TC(j) = / Ho<j)®'Hi{j) where TCo{j) is 
the finite-dimensional "parametric" subspace consisting of parametric functions, and 
Ti-i(j) is the "nonparametric" subspace consisting of smooth functions. The induced 
tensor product space is 

n = ®Ui n u) = ®s[{^jesn m ) ® (®^Woo->)] = ®sH s , 

where the summation runs over all subsets S C {!,-•• , d}. These subspaces TCs form 
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two large subspaces: Mm — {v '■ M(/i) = 0}, which is the null space of M(/x), and 
TCqMm with the reproducing kernel Rm(-, •)■ The solution of (I2.3p has an expression 

m T 

/i(x) = ^ d v (j) v {x) + J ^2c i R M (s i ,x), (2.4) 

U=l 8=1 

where {(fi u }™ = i is a basis of A/"m, and and q are the coefficients, s = (s 1; • • • , s T ) is 
a distinct combination of all Xijii — 1, • • • , n, j = 1, • • • , n.j). 

Example 2.3 Consider the temporal variation under a treatments. Take the fixed 
effect as fi(t, r), where r G {1, . . . , a] denotes the treatment levels. One may decom- 
pose 

fi(t, r) = m + nx(t) + A* 2 (r) + H\#(t, r), 

where \i% is a constant, /ii(t) is a function of t satisfying yUi(O) = 0, /^(t) is a 
function of r satisfying J2°r=i / i 2( r ) = 0, and Hi^it, r) satisfies ^1,2(0, r) = 0, Vr, and 
X^t=i / i i,2(^ 5 r ) — 0, VI The term \i$ + /ii(t) is the "average variation" and the term 
^2(7") + n±,2(t,r) is the "contrast variation". 
For flexible models, one may use 



(2.5) 



pT pT « 

M(/x) = ^r 1 / {d 2 ^/dt 2 fdt + e^ 2 V(dVi,2M 2 ) 2 ^, 

Jo ' Jo T=1 

which has a null space Mm of dimension 2a. A set of 4> u are given by 

{l,t,I {j} (r) - 1/a, (I{j}(r) - l/a)t,j = 1, . . . ,a - 1}, 

and the function i?M is given by 

RM(ti,T 1 ;t 2 ,T 2 ) = 61 / (ti-u) + (t 2 -u) + du+6i :2 (I{ T1 }(r2)-l/a) / (ti-w) + (t 2 -w)+c?w 



S 



See, e.g.. IGul (120021 §2.4.4). To force an additive model 



fx(t, t) = m + ni(t) + // 2 (t), 



(2.6) 



which yields parallel curves at different treatments, one may set = and remove 
(7{j}(r) — l/a)t from the list of 4> u . 

Substituting (J23H int o Q, we have 

(y - Sd - Rc - Zb) T (y - Sd - Rc - Zb) + b T ftb + iVAc T Qc, (2.7) 



T\T 
n I i 



where y = (yj 7 • • • , y^) T , d = (d h • • • , gQ t , c = {c u ■ ■ ■ , c T ) T , b = (bf , • • • , b 
S = (Sf , • • ■ , S^) T with the (fc, z/)th entry of the rii x m matrix Sj equal to 4> u (t ik ), 
R = (Rf , • ■ ■ , R^) T with the (/, j)th entry of the xT matrix R, equal to Ruitu, Sj), 
the design matrix Z = diag(Z x , • • • , Z n ), f& = a 2 diag(B _1 , • • • , B -1 ) and Q is T x T 
matrix with the (J, k)th entry equal to RM^Sj, s&). 

Differentiating (12 .7p with respect to d, c and b and setting the derivatives to 0, 
one has 



/S T S S T R S T Z \ /d\ 

R T S R T R + (iVA)Q R T Z c 
\Z T S Z T R Z T Z + QJ \b) 



(s T y\ 
n T y 

\ zT y) 



(2.8) 



The system (I2.8P can be solved through the pivoted Cholesky decomposition followed 



by backward and forward substitutions. See, e.g., Kim and Gu (2004) for details. 



The fitted values y = Sd + Rc + Zb of (12.31) can be written as y = A(A, fl)y, 
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where A(A, O) is the smoothing matrix given below, 



A(A,0) = (S,R,Z) 



/S T S S T R S T Z \ 

R T S R T R + (iVA)Q R r Z 
K Z T S Z T R Z T Z + Qj 



fs T \ 

R T 
\7?) 



(2.9) 



and C + denotes the Moore-Penrose inverse of C satisfying CC + C = C, C + CC + = 
C+ (CC+) T = CC+ and (C+C) T = C+C. 

With varying smoothing parameters A (including 9) and correlation parameters f2, 
(12.81) defines an array of possible estimates, in which we need to choose a specific one 
in practice. A classic data-driven approach for selecting the smoothing parameter A is 



generalized cross-validation (GCV), which was proposed in Craven and Wahba (1979) 



Treating the correlation parameters fl as extra smoothing parameters, we adopt the 



approach of Gu and Ma (2005) to estimate A and the correlation parameters ft si- 



multaneously through minimizing the GCV score 

iV-V(I-A(A,tt)) 2 y 



v(\,n) 



{N-hr(i- A(\,n))y 



(2.10) 



Since the GCV score V^(A, f2) is non-quadratic in A and Q , one may employ 
standard nonlinear optimization algorithms to minimize the GCV as a function of the 
tuning parameters. In particular, we used the modified Newton algorithm developed 



by Dennis and Schnabel (1996) to find the minimizer. The distinguishing feature of 



generalized cross-validation is that its asymptotic optimality can be justified in a 
decision-theoretic framework. One may define a quadratic loss function as, 



1 n 

l{\ n) = - Y&i - 

i=i 
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Under general conditions, Gu and Ma (2005) showed that the GCV tracks the loss 
function asymptotically, 

1 - 

v(x, ft) - n)--Y, = o P (L(\, ft)). 



8=1 



Note that does not depend on A and O. It then follows that the minimizer of 
the GCV score V(A, ft) approximately minimizes the loss function L(X, ft). 

2.3 Bayesian Confidence Intervals 

Unlike confidence estimates in parametric models, a rigorously justified interval esti- 
mate is a rarity for nonparametric functional estimation. An exception is the Bayesian 



confidence interval developed by Wahba (1983) from a Bayes model. A nice feature 



of Bayesian confidence intervals is that they have a certain across-the-function cover- 



age property. See Nychka (1988) In this section, we derive the posterior mean and 
variance for constructing Bayesian confidence intervals in our setting. 

The regularization is equivalent to imposing a prior on the functional form of 
To see this,we decompose /i = f + fi, where f has a diffuse prior in the space Mm 
and fi has an independent Gaussian process prior with mean zero and covariance, 

2 

#[/i(sfc)/i(*i)] = J^x R M( s k> s T )Q + i? M (s, Sl ). (2.11) 

The minimizer of ( 12.31) can be shown to be the posterior mean under the above prior 
by the following theorem. 

Theorem 2.1 With the prior for \x specified above and a generic npxl vector z, the 



11 



posterior mean of fi(x) + z T b has the following expression: 



E[/i{x) + z T b\y] = T d + ft + z T b, (2.12) 

where <f> is m x 1 with the vth entry 4>v(x), $, is T x 1 with the ith entry R(si, x), d, 
c, and b are the solutions of Ii2.8\) . 

The posterior variance is given in the following theorem. 

Theorem 2.2 Under the model specified in Theorem \2.1[ the posterior variance has 
the following expression: 

A A Var[n(x) + z T b\y] = £ T Q+£ + N\z T Vt + z + (j) T (S T W- l S)- l (j) 



a 2 



- 20 T (S T W- 1 S)- 1 S T W- 1 RQ+£ - 2iVA0 T (S T W- 1 S)- 1 S T W- 1 Zr2+z 
-(^ T Q + R T +ATAz T ri + Z)(W- 1 -W- 1 S(S T W- 1 S)- 1 S T W- 1 )(RQ + |+^AZri + z), 

where W = RQ+R T + NXZn + Z T + NXl. 

The proofs of the above two theorems are given in Appendix. Using Theorem 
12.11 and Theorem 12. 2\ we construct the 100(1 — a)% Bayesian confidence intervals 



as, E[fi(x) + z T b\y] ± $(1 - a/2)" 1 y/Vax\p,(x) + z T b\y], where $(1 - a/2)" 1 is the 
100(1 — a/2) percentile of the standard Gaussian distribution. Letting z = 0, we get 
Bayesian confidence intervals for fi(x). Note that the construction of Bayesian con- 
fidence intervals is pointwise. It is unclear whether the across-the-function coverage 



property of Nychka (1988) holds in our case. 
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3 The Mixture Model 



Based on the mixed-effect representation of homogeneous functional data, we shall 
now present a mixture model for characterizing the heterogeneity. 

3.1 The Model Specification 

When the population is heterogeneous, we assume that the ith functional observation 
can be modeled as 

Hi = /Xfc(xi) + Z;b; + €i with probability p k (3.1) 

where k = 1, • • • ,K, the fcth cluster's mean is a smooth function defined on a 
generic domain T, bj ~ N(0, B fc ) is a p x 1 random effect vector associated with a 
Hi x p design matrix Zj, €j ~ N(0,a 2 I) are random errors independent of the bj's 
and of each other, cluster probabilities p k satisfy J2k =1 Pk = 1, and K is the number 
of clusters in the population. 

To ease the computation, we introduce a "latent" membership labeling variable 
Jik such that Jik = 1 indicates individual i belongs to the fcth cluster and Jn~ = 
otherwise. Thus we have the probability that = 1 is pk- The mixture Henderson's 
likelihood is seen to be 

n K 

^log^frkfyiVi'M, J ik = l)f b (hi]J ik = 1)] 

i=l k=l 

where f y and fb are probability density functions for y { and bj respectively. 
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3.2 Estimation 

The negative penalized Henderson's likelihood of complete data (y i7 J^) where i 
1, • • • , n, is seen to be 



n K 

L c = Constant -/./.Jik log Pk 



i=l k=l 

n K K 



+^ Yl Yl ^fc[(?/i-^(x i )-Z l b J ) T (?/ i -/i fe (x i )-Z 4 b i )+a 2 bf B-^+Y NX k M(fi k ) 

(3.2) 



2a 2 

i=l k=l k=l 



where X k is the smoothing parameter for fi k . 

Once the penalized Henderson's likelihood (13.21) is obtained, the EM algorithm 
(Dempster et al. 1977 IGreen 199011 can be derived as follows. 



The E-step simply requires the calculation of 

Pk<f(yi,^k(^i),^k) 
Wik = — x (3-3) 

where = ZjB^Zf + <r 2 I, and tp is the Gaussian density function. 

The M-step requires the conditional minimization of the following equation 

K n 



YY Wik log Pk ( 3 ' 4 ) 

k=l 1=1 

^ n K K 

Y^YY Wik ^yi ~ M x - ^ihik) T (yi - A**(xi) - Zibi fc ) + o^b^B^bifc] + ^ NX k M(/j, k ) 



i=l k=l k=l 

(3.5) 

where h ik is bj given the membership J ik . Thus the M-step is equivalent to minimizing 
(13.41) and (13. 5p separately. 
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By minimizing (13.41) . we have 

1 - 

Pk = ~y^w ik for k = l,---,K. (3.6) 

i=l 

By minimizing (13.51) . we can minimize the following K equations simultaneously 

■n 

^2 Wik[(yi-Vk(xi)-Zibik) T (yi-Vk(xi)-Zibi k )+a^ k = 1, • • • , K. 

i=i 

(3.7) 

where 1 /2a 2 is absorbed into X k . The minimization of (13.71) is performed in the 
reproducing kernel Hilbert space Ti C {r/ : M{jj) < oo}. Substituting solution (12.41) 
into (13.71) . we have 

(y - Sd k - Rc k - Zb k ) T W k (y - Sd k - Rc k -Zh k ) + b£w' /2 ft fe W' /2 b fe + iVA fc cjQc fc , 

(3.8) 

where y = (yf , ■ ■ • , yl) T , d k = (d lk , ■ • • , d mk ) T , c k = (c lk , • ■ ■ , c Tk ) T , b k = (bf fc , • • ■ , b^,) T , 
S = (Sf , • • • , S^) T with the (A;, u)th entry of the x m matrix equal to 4> u (t ik ), 
R = (Rf , • • • , R^) T with the (/, j)th entry of the n« xT matrix R« equal to Rm{Ui-, 
the design matrix Z = diag(Zi, • - • ,Z n ), W fc = diag(wi fc I ni , • ■ ■ ,w nk l nn ), W k = 
diag(u> lfc I p , • • • , w nk I p ), fl k = (j 2 diag(B A r 1 , • • • , B^ 1 ) and Q is the T x T matrix with 
the (J, k)th entry equal to Rm(sj, s k ). 

Writing (13.81) in a more compact form, we have 

{Vwk~ S^fcdfc— R wk c k — Zh wk ) T (y wk — S^d^ — H wk c k — Zb m / c )+b^ fc fifcb UI fc+A^AfccjQcfc, 

(3.9) 

where y wk = w£ /2 y, S wk = W^S, K wk = W^R, Z wk = W,f ZW~ 1/2 , and 
b wk = W L k ^b k . Then (ED can be minimized using the techniques developed in 
Section 2. 
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The variance of measurement error is estimated as 

^ n K 

b 2 = j-j: ^^WikiUi - /ife(xi) - Z i b ik ) T (y i - /ifc(xi) - Zibjfc). (3.10) 

i=l k=l 

The algorithm iterates through ( 13.3[) . ( 13.6f) . ( 13. 91) and ( 13. 10[) until all the parame- 
ters converge. 

The selection of the smoothing parameters and plays an important role 
in the proposed algorithm. When first run our algorithm, in each iteration, the 
optimal smoothing parameters are selected for each cluster using GCV in (13.91) . Once 
all parameters converge, we fixed the selected smoothing parameters and run our 
algorithm for fixed smoothing parameters. 

After we fit the mixture model to the data, we can give a probabilistic (soft) clus- 
tering of each observation That is, for each y { , wn, ■ ■ ■ ,WiK give the estimated 
probabilities that this observation belongs to the first, second,..., and Kth compo- 
nents, respectively, of the mixture. However, in many practical settings, it is highly 
desirable to give hard clustering of these observations by assigning each observation to 
one component of the mixture. In the rest of the paper, we adopt the hard clustering 



of McLachlan and Peel (2001) by estimating the membership label, 



Jik 



1 if k — aigma,x h Wih 
otherwise 



where k = 1, • • • , K and i = 1, 



3.3 Efficient Computation with Rejection Control 

With thousands of observations under consideration, the E-step (13.31) results in a 
huge number of w^'s, many of which are extremely small. With the presence of these 
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small u>jfc's, the calculation of matrices involved in the M-step (13. 9p is expensive, 
unstable and sometimes even infeasible. To alleviate the computation and stabilize 
the algorithm, we propose to add a rejection control step (ILiu et al. 19981) in the EM 
algorithm and refer to the modified algorithm as rejection controlled EM algorithm. 

Firstly, we set up a threshold value c (e.g., c = 0.05). Given this threshold value, 
we introduce the following rejection controlled step: 



The resulting w* k needs to be normalized: w*£ = w* k / J2k w lk- Then we replace Wik 
by w*l right after the E-step (13.31) . Note that when c = 0, the proposed algorithm 
is exactly the original EM algorithm, whereas the proposed algorithm reduces to a 
variant of Monte Carlo EM algorithm (IWei and Tanner 19901) when c = 1. In this 
way, it is possible to make accurate approximations during the E-step while greatly 
reducing the computation of the M-step. 

Finally, in order to avoid local optima, the rejection controlled EM is run with 
multiple chains. In practice, we first set the threshold c close to 1 at an early stage 
of the iterations to expedite the calculation, then we gradually lower c so that the 
algorithm can achieve a better approximation of the original EM. 

A critical issue arising from the new algorithm is how to choose an appropriate 
stopping rule. For the original EM algorithm, the likelihood function increases after 
each iteration, so we can stop the iteration when the likelihood does not change. 
However, for the rejection controlled EM algorithm, the likelihood functions fluctuates 
because of the sampling scheme. So a stopping rule like those used in the Gibbs 
sampler is employed. When the likelihood function is no longer increasing for several 




w ik if Wik > c 

c with probability Wik/c if 

with probability 1 — Wik/c 
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consecutive iterations, we stop and choose the estimates with the highest likelihood. 

3.4 The Selection of the Number of Clusters 

The success of our proposed methods heavily depends on the selection of the num- 
ber of clusters K. A natural choice in model-based clustering is to use the Bayesian 
Information Criterion (BIC). The BIC imposes a penalty on the total number of pa- 
rameters, scaled by the logarithm of sample size, so as to strike a balance between 
the goodness-of-fit and the model complexity. A critical issue in using BIC in non- 
parametric settings is to determine the effective number of parameters. Here we use 
the trace of the smoothing matrix to approximate the number of parameters in each 
cluster (IHastie and Tibshirani 1990} IGu 2002]) . Thus BIC under our model is 

n K K 

bic =-2j2 fos^pMvi, M**), £*) + (J2 tlA k( x k, n*) + p) log at, (3.11) 

i=l k=l k=l 

where is the smoothing matrix for the kth cluster as defined in (12.91) . P is the 
number of free parameters in pk, and flk where k — 1, • • • , K. 

4 Simulation 

To assess the performance of the proposed method, we carried out extensive analysis 
on simulated datasets. 

This simulation is designed to demonstrate the performance of the proposed 
method when the underlying clusters' mean functions are different for different clus- 
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ters. First, one hundred replicates of samples were generated according to 



y lijT = 3 sin(67rtj)(l - tj) + 2I {1} (t) - 1 + e lijT , i = 1, • • • , 30; 
y 2 i jT = 3 sin(67r^)(l - tj) + e 2ijT1 z = 1, • • • ,40; 
y 3ii r = 198(Mj(l - t,) 3 + 858^(1 - tj ) 10 - 2 + e 3ijT , i = 1, • • • , 50; 
Viijr = 3 sin(27rtj) + 21 {1} (r) - 1 + e 4iir , z = 1, • • • ,30; 

where tj = 1/15, 2/15, • • • , 1, r = 0, 1, indicator function I{i}(r) = 1 if r = 1 and 
otherwise, random errors e were generated from a Gaussian distribution with mean 
zero and covariance matrix as follows: 

VaT[e hjT ] = 1, Cov(euj T1 ,eu kV2 ) = 0.2, for I = 1,3; 
Var[e KiT ] = 1.2, Cov(e UjT1 ,ei ikT2 ) = 0.4, for I = 2,4; 

We analyzed the simulated data using the proposed method with the following 
mixture model 

y i = /i fc (t, r) + M + 6» with probability 

where k = 1, ■ ■ ■ , K, t = 1,2 for two groups, 6j ~ N(0, a%) is the individual specific 
random effect. The important feature of the simulated data is that the true mean 
curves in two groups, indexed by r, are either identical or parallel. This information 
was built into our method through enforcing the additive model (12. 6p . The penalized 
Henderson's likelihood was employed for estimation with roughness penalty M(fi) = 
fi(d 2 in/dt 2 ) 2 dt. 



We compared our method with MCLUST (Fraley and Raftery 1990), FCM classi- 



fication likelihood (FCMc), and FCM mixture likelihood (FCMm) (James and Sugar 2003) 
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Figure 4.1: The estimated mean curves (dash lines) and 95% Bayesian confidence 
intervals for one simulated dataset. The true functions are superimposed as solid 
lines. 



Since the number of clusters must be specified a priori in the partially implemented 
FCM software, we gave a significant starting advantage to the FCM algorithm by 
letting the number of clusters be the true number of clusters (four). For MCLUST, 
the clustering result with optimal BIC was reported, which was estimated from eight 
models with different covariance structures. The estimated mean curves using the 
proposed method for each cluster and the true curves of one sample are plotted in 
Figure 14.11 

For comparison, we need a measure of the agreement of the clustering results with 
the true cluster membership. A popular one is the Rand index, which is the percentage 



of concordance pairs over all possible data pairs. Hubert and Arabie (1985) proposed 
an adjusted Rand index, which takes one as the maximum value when two clustering 
results are the same and the expected value is equal to zero when two clustering 
results are independent. We found that across 100 samples the average of the adjusted 
Rand indices for the proposed method is 0.9676 (median is 0.9838), whereas those of 
MCLUST, FCMm, FCMc are 0.7553, 0.8936, and 0.8896, respectively. Moreover, the 
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inter-quartile range of the adjusted Rand indices of the proposed method is 0.0565, 
(0.0972, 0.2189 and 0.2262 for MCLUST, FCMm, and FCMc respectively). These 
results suggest that the proposed method outperforms FCMc and FCMm (even under 
the ideal scenario where the true number of clusters is provided to FCMc and FCMm 
a priori) as well as MCLUST. 



5 Real Data Examples 

5.1 Comparative Genomic Study of Fruitfly and Worm Gene 
Expressions 

Development is an important biological process that shares many common features 
among different organisms. It is well-known that D. melanogaster (fruitfly) and C. 
elegance (worm) are two highly diverged species, the last common ancestor of which 
existed about one billion years ago. Their development is an active research area: 



In Arbeitman et al. (2002) , the mRNA levels of 4028 genes in D. melanogaster were 
measured using cDNA microarrays during 62 time points starting at fertilization and 
spanning embryonic, larval, pupal (metamophosis) stages and the first 30 days of 
adulthood. mRNA was extracted from mixed male and female populations until 
adulthood when males and females were sampled separately. Jiang et al. (2001) re- 
ported a cDNA microarray experiment for 17871 genes over the life-cycle of C. elegans 
at 6 time points, including eggs, larval stages: LI, L2, L3 and L4, and young adults. 
To study the genomic connections in expression patterns across the two species, we 



combined the gene expression datasets of Arbeitman et al. (2002) and Jiang et al. (2001) 



using the orthologous genes provided by McCarroll et al. (2004) , which resulted in a 



merged expression dataset containing 808 orthologous genes. We analyzed the data 
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using the proposed method with the mixture model, 



y i = /i fc (t,r) + bil + Si 

with probability p k where k = 1,---,K, r = 1 for fruitfly and r = 2 for worm, 
hi ~ iV(0, of) is the gene specific random effect. The penalized Henderson's likelihood 
was employed with roughness penalty M of the form (12. 5p . Sex differentiation of the 
fruitfly was modeled by a branching spline ([Silverman and Wood 1987|) . the general 



analytic form of which with two branches on the right is 

{XT=i d u 4>v{t) + Ya=i CiRuisi, t) if t < s k 
J2™=1 du4>v{t) + Yh=1 CiR M {Si, t) + YA=k+l CliRM(Si - Sfc, t - s k ) if t> s k 
J2™=i d v (j) v (t) + Y!l=i CiR M {sh t) + Yn=k+i c 2i R M{si - s k ,t- s k ) if t > s k 

where s k is the branching point, and the second and third rows are expressions of 
the two branches. A cubic smoothing spline was used. The 808 genes were clus- 
tered by our method into 34 clusters. Biological functions of genes in each cluster 
were annotated using Gene Ontology, and Bonferroni corrected P-values of biolog- 
ical function enrichment were calculated based on the hypergeometric distribution 
( ICastillo-Davis and Hartr"20 03). Of the 34 clusters discovered, 21 clusters exhibit 
significant biological functions over-representation (P- value < 0.05). The estimated 
mean gene expression curves of three clusters and their 95% Bayesian confidence 
intervals are given in Figure 15.11 

In cluster A, which consists of 31 genes, gene expressions of worms have peaks 
at eggs, larva and young adult. In the same cluster, we observed that fruit-fly gene 
expressions that are up-regulated during embryogenesis are also up-regulated during 
metamorphosis, suggesting that many genes used for pattern formation during em- 
bryogenesis (the transition from egg to larva) are re-deployed during metamorphosis 
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Figure 5.1: Estimated mean expression curves and 95% Bayesian confidence intervals 
(grey bands) for cluster A, B and C (from top to bottom) discovered in the worm-fly 
temporal expression data. Vertical solid lines separate worm (eggs, larva and young 
adult are separately by dash lines in the left frame), fruit-fly (embryogenesis, larva, 
pupa, and adult stages are separated by dash lines in the right frame). Adult fruit-fly 
male and female mean expression curves are labeled as M and F, respectively. 

(the transition from larva to adult). Consistently, this cluster is enriched for genes 
involved in embryonic development (P- value =0.0003), post-embryonic body morpho- 
genesis (P-value =0.007), and mRNA processing (P- value = 0.002), among others. 

In cluster B, consisting of 24 genes, gene expressions of worms increase start- 
ing at eggs until they reach a peak at late larval stage. Then expressions go down 
during adulthood. However, we observed that fruit-fly gene expressions that are 
down-regulated during embryogenesis are up-regulated during metamorphosis and 
adult, suggesting that many genes are involved in development. The enriched gene 
functions are embryonic (P-value =0.02), larval development (P-value =0.008), and 
growth regulation (P-value < 10 -5 ). 

Cluster C contains 25 genes. For worms, gene expressions have peaks at larva 
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and adult stages. An over-representation of gene functions such as reproduction 
(P- value < 10~ 6 ), larval development (P- value < 10~ 7 ). Fruit-flies show peaks in 
gene expression in the early embryo, and older females (but not males). An over- 
representation of gene functions such as reproduction (P-value < 10~ 6 ) and embryonic 
development (P- value < 10~ 5 ) are present in this cluster. Among related functions, 
this cluster also contains functions of female gamete generation, growth, and positive 
regulation of growth rate. Genes of this cluster are thus inferred to participate in sex 
determination, female production of eggs, and growth regulation. 



5.2 Budding Yeast Gene Expression under Aerobic and Anaer- 
obic Conditions 



To study the oxygen-responsive gene networks, Lai et al. (2006) used cDNA microar- 
ray to monitor the gene expression changes of wild-type budding yeast (Saccharomyces 
cerevisiae) under aerobic and anaerobic conditions in a galactose medium. Under the 
aerobic conditions, the oxygen concentration was lowered gradually until oxygen was 
exhausted during a period of ten minutes. After 24 hours of anaerobiosis, the oxygen 
concentration was progressively increased back to normal level during another period 
of ten minutes, which was referred to as the anaerobic conditions. Microarray exper- 
iments were conducted at 14 time points under aerobic conditions and 10 time points 
under anaerobic conditions. A reference sample pooled from all time points was used 
for hybridization. 



For their analysis, Lai et al. (2006) normalized gene expressions to gene expres- 
sions of time and filtered out differentially expressed genes. Thus the normalized 
expressions at 23 time points of 2388 differentially expressed genes are used for our 
clustering analysis. We modeled normalized gene expression y i of the ith gene using 
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Figure 5.2: Estimated mean expression curves and 95% Bayesian confidence intervals 
(grey bands) for cluster A, B and C (from top to bottom) discovered in the yeast aero- 
bic and anaerobic expression data. The aerobic (left) and anaerobic (right) conditions 
were separated by two vertical lines. 



the mixture model, 



y i = n k (t, t) + bil + £j with probability p k 



where k = 1, • • • , K, r = 1 for aerobic and r = 2 for anaerobic condition, 6j ~ 
N(0, of) is the gene specific random effect. We fit the model using the penalty ( 12.51) 
with a = 2. In total, 2388 genes were clustered into 28 clusters using our method. 
FunSpec ( IRobinson et al. 200211 was used for gene annotation and biological function 
enrichment analysis. We found 26 clusters out of 28 clusters discovered have over- 
represented biological functions. The estimated mean gene expression profiles and 
associated Bayesian confidence intervals of three clusters are given in Figure 15.21 

In cluster A, which consists of 57 genes, the estimated mean expression goes down 
progressively as oxygen level goes down, which suggests that the genes in this cluster 
are transiently down-regulated in response to anaerobisis. Furthermore, the estimated 
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mean expression increases as oxygen concentration shifts back to normal level. Ac- 
cordingly, genes involved in respiration, lipid fatty-acid and isoprenoid biosynthesis, 
and cell defense are over-represented in this cluster (P- value < 1CT 5 ). 

In contrast to cluster A, cluster B (85 genes) consists of genes involved in vari- 
ous biosynthesis, metabolism and catabolism such as glucose metabolism (P-value< 
1CT 6 ). These biological processes are necessary to maintain the basic living needs of 
yeast cells. Interestingly, the alcohol biosythesis and metabolism are also enriched in 
this cluster. Consistent with biological function over-representation, the estimated 
mean expression is up-regulated in aerobic conditions and down-regulated in anaero- 
bic conditions. 

We have 70 genes in cluster C, where the estimated mean gene expression goes 
up at the beginning and then drops down rapidly under aerobic conditions. Under 
anaerobic conditions, the estimated mean gene expression is up-regulated. In this 
cluster, respiratory deficiency and carbon utilization are also over-represented (P- 
value< 1CT 8 ). The initial up-regulation of gene expression under aerobic conditions 
can be partly explained by the fact that the cell increases energy up-taking through 
other biological processes, such as carbon utilization, when oxygen goes down. But 
as the oxygen level continues to drop, these processes are replaced by more energy 
efficient processes, such as glucose metabolism. Under the anaerobic conditions, these 
processes are revitalized again as oxygen level increases. 

6 Discussion 

In this article, we propose a clustering method for large scale functional data with mul- 
tiple covariates. Nonparametric mixed-effect models were built, which were nested 
under a mixture model. The penalized Henderson's likelihood was employed for 
estimation. Data-driven smoothing parameters, selected through generalized cross- 
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validation, were used to automatically capture the functional features. The rejection- 
controlled EM algorithm was designed to reduce the expensive computational cost for 
large scale data. The simulation analyses suggest that the proposed method outper- 
forms the existing clustering methods. Moreover, the Bayesian interpretation of the 
proposed method allows the development of an equivalent fully Bayesian functional 
data clustering method, which can accommodate additional genomic and proteomic 
information for gene expression study. Although it was motivated for clustering 
temporal expression data, our proposed method has a wide spectrum of applica- 
tions, including those involving seismic wave data arising from geophysical research 
(Wang et al. 2006 and lMa et al. 20 07). The calculations reported in this article were 
performed in R. Open-source code is available in the R package MFDA. 

As a sequel to this work, a clustering method for discrete data, especially those 
arising from temporal text mining, is under active development. 



Appendix 

Proof of Theorem \2.1\ : 

Note the fact that if we specify the prior for fo as a Gaussian process with mean 
zero and covariances E[f (sk)fo(si)] = r 2 Y^=i 0v( s fc)0i/( s «)) then when r 2 — > oo, the 



prior for / becomes a diffuse prior; see Wahba (1983) and Gu (2002) 



Assuming fo(t) has a Gaussian process prior specified above, fi(x) has a Gaussian 



process prior specified as in Theorem I2.1[ and b follows a normal distribution with 
mean zero and variance-covariance matrix B, we can derive that the joint distribution 
of y and fo(x)+fi(x)+z T h follows a Gaussian distribution with mean and covariance 
matrix 

' 6FV+F T + r 2 SS T + a 2 I 6FV+£ + r 2 S(p 



bfv+F T + r 2 T S T bfv+$, + r 2 4> T (f> 



(6.1; 
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where £ = (i?i(si, x), . . . , Ri(st, x), z) t is (T + p) x 1, is m x 1 with the vth 
entry (j) u (t), F = (R, Z), and V^ 4 " is the Moore-Penrose inverse of V = diag(Q, 
satisfying VV+F T = F T . 

Standard calculation yields 



E[p(x) + z T b\y] = (b£ V+F T +T 2 cj> r S T )(bFV + F T + r 2 SS T + a 2 I)- 1 y 
= p0 T S T (W + pSS T )- 1 y + ~fv + F T {W + P SS T )- 1 y, 



where p = r 2 /b, NX = cr 2 /6, and W = FV + F T + NXI. Now letting p — > oo, we have 
lim (pSS T + W)- 1 = W- 1 - W- 1 S(S T W- 1 S)- 1 S T W- 1 , (6.2) 

p^oo 

lim pS T (pSS T + W)- 1 = (S^W^S)^ 1 ^ 1 . (6.3) 

p^oo 



See Wahba (1983) and Gu (2002) for the proof 



Therefore, linv^^ E\ji{x) + z T b|?/] = </> d + £ c, where 

d = (S T W- 1 S)- 1 S T W _1 i/, c = V+F T (W- 1 - W- 1 S(S T W- 1 S)^ 1 S T W- 1 ) 2/ . (6.4) 

It is straightforward to verify that the d and c given in ( 16.4f) satisfy ( 12. 8ft . 
Proof of Theorem \2.S\ : 

The posterior variance can be easily calculated by using expression (16. ip as follows, 

var[ / u(x)+z T b|y] = fv+^+pcf) 1 0-(£ T V + F T +p0 T S T )(W+pSS r )- 1 (FV+£+pSc£) 

Notice that lim^ P I-p 2 S T {pSS T +W)- l S = (S T W~ 1 S)- 1 , and VV+F T = F T . 
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Therefore as p — > oo, we have 



lim Var[fi{x)+z T b\y]/b = ^V+^+^^W^SrV-S^^W^Sr^W^FV+l 

r 2 ^oo 

- ^ T V + F T (W- 1 - W- 1 S(S T W- 1 S)- 1 S T W- 1 )FV+|. 
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